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Realistic  parameters  are  attainable  in  spite  of  missing  data.  DOVE  can  be  I 
useful,  even  when  many  or  most  data  are  missing,  for  generalized  least 
squares  fitting  to  evaluate  a self-consistent  set  of  all  parameters  in  an 
expression  for  predicting  all  missing  data,  and  fZ) , without  changing  the 
predicted  data,  to  transform  the  set  of  parameters  obtained  in  phase  1 so  that 

each  final  parameter  has  a simple,  pure,  realistic,  physical  meaning.  Since 

predicted  data  are  expressed  F — + . . . -F  c^^with  ii  product^ t^ms  , 
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phase  2 requires  incorporation  of  i^^W'^dependent  subsidiary  conditions,  of 
which  2n  are  arbitrary , i.e.,  merely  fix  zero  reference  points  and  scale  unit 
sizes,  but  -ri^re  critical,  i.e.,  must  be  relationships  between  particular 
parameters^upported  by  other  information.  Both  phases  are  Illustrated  by  a 
two-mode  ^plication  with  7 10  j^,  hence  41  parameters,  to  fit  the  data  plus 

the  6 subsidiary  conditions.  Valid  parameters  are  obtained  although  30  of  the 
70  possible  data  are  missing. 
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DOVE  is  a handy  procedure  for  predicting  missing  data  and  forcing  every  para- 
meter in  the  fitted  expression  to  have  a simple,  discrete,  realistic,  physical  meaning. 
The  acronym  DOVE,  standing  for  "dual  obligate  vector  evaluation",  refers  to  its  two- 
phase  evaluation  of  all  parameters,  obligating  them  by  least  squares  in  phase  1, 
and  additionally  obligating  them  in  phase  2 by  subsidiary  conditions  that  are 
supported  by  Information  other  than  the  data  (10) . 


Phase  1 

Equation  1 embodies  the  least  squares  criterion  of  fit  (^)  • 


i J 


minimum 


(1) 


Here  ^ and  £ refer  to  observed  and  predicted  data,  Jl  specifies  the  variable  of 

main  Interest,  JL  specifies  all  other  variables,  e.  is  unity  if  £ . , exists  but 

^ J ^ J 

zero  for  any  ij^  combination  not  observed,  and  are  suitable  statistical  weights. 

Equation  2 is  a generalized  form  of  a widely  applicable  expression  for 
predicted  data. 


%J  “ %nAnJ  ^i 


(2) 


Its  parameters  comprise  factors  slopes  £ and  Intercepts  c (2).  However,  the 
confusion  of  double  subscripts  on  factors  and  slopes  can  be  avoided  by  a notation 
using  different  factor  and  slope  symbols  for  each  different  product  term  or  mode  m. 
Therefore  we  will  switch  to  expressions  for  such  as 


£ij 

£lj  - +£l 

£lj  - Xj  + £i 

- «i  ij  + bj  Xj  + Xi  + S-i 


(3) 

(A) 

(5) 

(6) 


or  an  equation  with  even  more  modes  as  soon  as  we  have  decided  on  the  number,  n, 
of  modes  to  Include. 

The  subscripts,  and  need  elucidation.  Subscript  refers  to  the  main  or 
primary  variable,  while  subscript  _!  refers  to  all  other  variables.  To  be  more 
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preclse,  is  a numerical  Index  for  a specific  example  of  the  principal  variable. 

In  the  past,  this  specific  example  has  been  variously  called  a case,  individual, 
object,  entity,  or  unit.  Since  most  of  these  names  are  ambiguous  or  cumbersome, 
we  will  call  it  a "jot".  Subscript  is  a numerical  index  for  a group  having  a 
common  set  of  all  the  other  variables.  This  group  has  also  been  called  a variable, 
attribute,  characteristic,  property,  class,  or  series.  We  will  call  it  an  "ilk" 

(2).  For  example,  in  a study  of  solvent  effects  the  main  variable  is  the  solvent. 

A of  1 might  denote  that  water  is  the  jot,  while  a of  2 might  identify  the  jot 
as  ethyl  alcohol.  An  ^ of  1 might  refer  to  an  ilk  composed  of  logs  of  rate  constants 
for  a particular  reaction  at  25®C  in  all  the  solvents  in  which  it  has  been  studied, 
while  l.=2  might  mean  an  ilk  of  spectral  measurements  of  the  frequency  for  a parti- 
cular electronic  transition  of  a particular  compound  in  different  solvents. 

Equations  A,  5 or  6 might  suggest  that  we  are  only  fitting  a line,  plane  or 
hyperplane,  respectively.  If  the  factors  (x^ , y^ , ...)  were  all  known  in  advance, 
this  would  Indeed  be  only  a straightforward  linear  regression  analysis  to  evaluate 
the  ^-subscripted  parameters.  However,  if  any  of  these  factors  are  unknown,  the 
observed  data  must  be  used  to  determine  j^-subscrlpted  parameters  as  well  as 
subscrlpted  parameters.  Thus,  in  general  this  is  a nonlinear  rather  than  a linear 
least  squares  problem.  Furthermore,  the  phase  1 least  squares  is  more  general 
than  linear  for  another  reason:  any  of  the  factors  produced  could  prove  to  be  a 
nonlinear  function  of  one  of  the  other  factors  or  of  several  of  them. 


Phase  2 


The  least  squares  condition,  eq.  1,  is  not,  in  general,  sufficient  to  determine 

the  parameters  s^^  and  f^^  uniquely.  For  example,  if  satisfies  eq.  4,  then  all 

the  values  of  a.  could  be  doubled  while  all  the  values  of  x.  are  halved  without 
—1  -j 

affecting  the  values  of  or  the  criterion  of  fit  of  eq.  1. 

Therefore  we  propose  to  follow  the  tradition  (embodied  in  the  Br^nsted  catalysis 
law  and  the  Hammett  equation  (^))  of  making  the  factors  represent  conceptually  simple 
physical  Influences  of  jots  rather  than  only  a compact  means  for  representing  or 
predicting  data.  For  this  purpose  we  usually  need  to  transform  all  phase  1 parameters 
into  new  ones  having  a simpler  and  cl  :arer  interpretation,  by  incorporating  a number 
of  physically  meaningful.  Independent,  subsidiary  conditions  corroborated  by  other 
information  than  the  data 

Such  transformations  are  far  from  obvious  when  expressions  as  complicated  as 
eq.  5 or  6 hold.  In  fact,  the  interpretation  of  observed  or  measured  data  Is  then 
always  confounded  and  Invalid  conclusions  about  modes  and  parameters  have  usually 
been  drawn,  because  the  jot  affects  the  system  under  study  by  two  or  more 
mechanisms  of  interaction  rather  than  one,  and  the  relative  Importance  of  the  n 
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( 


mechanisms  changes  with  both  the  jot  and  the  Ilk  O.l.)  • 

The  total  number  of  necessary  subsidiary  conditions  for  n=2  (equation  5) 
is  derived  below,  as  an  example.  The  derivation  from  eq.  2 for  any  other  number 
of  modes  is  similar.  First  express  all  (predicted  data  from  a converged 

least-squares  solution)  as  the  product  of  a row  vector  l'  of  the  jL-subscr ipted 
parameters  times  a column  vector  J,'  of  the  j_-subscrlpted  parameters  and  unity  (6). 
The  primes  Indicate  values  calculated  in  phase 


^^1 


-Ij 


No  individual  is  changed  by  insertion  of  the  3x3  unity  (Identity)  matrix  I' 
or  its  equal  T~^  T between  l'  and  • 

Pn  ■ j;  - IX^T  J' 

— Ij  -1~  -j  Sj®  ^ 


T - 


1 1.12  t^ia' 
£21  £22  ^2  3 

0 0 1 


but  the  components  of  the  resulting  vectors 


!i-  iir‘  ■ (i, 


(7) 


J “ T J'  - 

-j  ~ -J 


1 


(8) 


are  constants  that  are  equally  good  solutions  of  eq.  1. 

ilj  ■ il^j"  *l  ”1  *'’1^3*  '1 

Five  transformation  aquations  (10-14,  representing  each  parameter)  derived 
from  equations  7 and  8 convert  the  old  (primed)  set  of  parameters  to  the  new 
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(unprimed)  set.  Matrix  T , the  Inverse  of  T,  was  used  to  derive  equations  12-]4. 


^ “ -^n-j 


+ 1^2^  +1^3 

-22^j  -23 


“ ^i.22-^  ■ -21^  ^ 


^1 


det 


-1  ■*■  ^^-12-23  ~ -13^-22^-^ 

^-11-22  ~ -12-21^ 


^ <113^21 


—11—23^—1^^  / det 


(10) 

(H) 

(12) 

(13) 

(14) 

(15) 


Obviously,  T must  be  chosen  so  that  det  ^ 0. 


There  are  six  degrees  of  indeterminacy  since  six  elements  to  ^23 
unspecified.  To  remove  this  indeterminacy,  we  must  specify  six  independent 
subsidiary  conditions  and  use  them  to  evaluate  these  six  elements. 

Four  of  the  six  necessary  conditions  are  trivial  in  this  example  where  n*2, 
because  2 references  and  2 standards  may  be  specified  arbitrarily.  Equating  the 
factor  for  one  Jot  in  each  mode  to  a reference  value  (commonly  zero)  is  analogous 
to  choosing  average  sea  level  as  a height  reference  or  the  freezing  point  of  water 
as  a temperature  reference.  Equating  a particular  factor  or  slope  to  a standard 
value  (never  zero,  commonly  unity)  is  analogous  to  choosing  the  meter  as  a standard 
of  length  or  K as  a unit  of  temperature.  It  merely  fixes  the  size  of  the  scale 
or  units  in  which  factors  for  that  mode  are  expressed. 

The  remaining  two  subsidiary  conditions  are  critical  ones  and  should  be  chosen 

with  care  and  clearly  stated,  because  they  do  have  physical  meaning  and  must  be 

substantiated  by  other  information  than  the  data  to  ensure  that  all  of  the 

-Ij 

transformed  factors  and  slopes  will  be  physically  simple  and  meaningful. 

2 

In  general,  the  total  number  of  necessary  subsidiary  conditions  is  n +n  , 

2 

of  which  2n  are  trivial  and  n -n  are  critical. 

The  transformation  of  the  parameters  is  required  only  once,  after  convergence 
has  been  reached  in  phase  1,  and  is  in  fact  much  simpler  in  program  coding  and 
much  faster  in  computer  execution  time  than  any  one  of  the  iterative  cycles 
preceding  convergence.  Nevertheless,  more  prior  thought  and  more  careful  Judgement 
is  required  in  phase  2 than  in  phase  1. 

There  are  circumstances  where  subsidiary  conditions  and  the  corresponding 
parameter  transformation  of  phase  2 are  unnecessary.  First,  one  might  want  to 
know  the  correlation  coefficient  between  observed  z and  predicted  £ data  corresponding 
to  one  or  more  eq.  2 expressions  for  j • Neither  correlation  coefficients  nor 
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values  are  changed  by  phase  2.  The  number  of  modes  could  be  deduced  as  the 
n value  that  gives  the  highest  correlation  coefficient.  Second,  one  might  want 
to  use  one  of  the  expressions,  probably  the  one  yielding  the  highest  correlation 
coefficient,  to  estimate  unmeasured  or  missing  data.  Although  principal  com- 
ponents and  other  standard  factor  analyses  cannot  be  relied  on  when  there  are 
missing  data  (_3) , DOVE  can.  However,  no  meaning  or  significance  can  be  attached 
to  the  parameters  produced  by  phase  1 other  than  their  ability  to  predict  data, 
because  they  are  only  one  set  out  of  multiple  infinities  of  sets,  all 
equally  good  for  reproducing  the  observed  data  and  predicting  missing  data. 

On  the  other  hand,  if  the  required  number  of  critical  subsidiary  conditions 
can  be  stated  and  Justified  as  true,  phase  2 can  be  used  to  sort  out  realistically 
the  underlying  Influences  of  different  jots,  and  the  sensitivities  to  these  Influ- 
ences in  different  environments  (ilks).  These  parameters  can  give  considerably 
more  insight  into  forces  and  mechanisms  than  the  measured  or  predicted  data.  This 
is  the  Intended  purpose  of  factor  analysis  (2)  • 


An  Example  Using  Equation  5 

DOVE  was  developed  as  an  essential  tool  to  solve  the  chemical  problem  of 
separating  substituent  effects  into  field  and  resonance  components.  After  proving 
highly  successful  for  this  purpose,  it  was  applied  to  separating  numerous  solvent 
effects  into  contributions  associated  with  anion  solvation  and  cation  solvation. 
Both  applications  will  be  published  separately  in  chemical  journals.  However, 
we  anticipate  that  DOVE  will  be  as  or  more  useful  in  many  other  fields  of  science, 
engineering,  and  management.  Since  we  went  to  prove  that  this  procedure  does 
yield  correct  answers  when  other  methods  fall,  and  to  explain  it  clearly  to 
encourage  its  more  widespread  use,  we  will  Illustrate  it  here  by  a synthetic  but 
easily  understood  geometric  example  which  we  used  to  test  the  procedure  because  the 
answers  are  known.  This  is  the  problem  of  using  data  on  7 properties 

(ilks  in  Table  1)  of  10  solid  right  circular  cylinders  (of  which  3 are  pictured 
in  Figure  1)  to  evaluate,  for  each  cylinder,  the  factors  (measures  or  functions 
of  radius  and  height)  responsible  for  variations  in  the  data  from  one  cylinder 
to  another,  and  to  evaluate,  for  each  property,  the  slopes  (relative  sensitivities 
to  these  factors)  responsible  for  variations  in  the  data  from  one  property  to 
another.  We  are  pretending  that  we  have  not  yet  discovered  a way  to  measure  radii 
and  heights,  but  wish  to  calculate  them  from  measurements  of  these  7 other 
properties  of  the  10  cylinders.  Otherwise  this  is  a fairly  realistic  example  for 
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showlng  the  kinds  of  limitations  on  such  evaluations  likely  to  arise  from 
Inability  to  measure  underlying  factors  directly. 

We  converted  the  data  to  log  data  (listed  In  Table  2)  because  a DOVF. 
phase  1 analysis  on  the  raw  data  gives  an  overall  correlation  coefficient  of  only  0.93 
with  2 modes  (6  modes  would  be  needed)  hut  logarithms  give  1.000  with  2 modes.  Wr 
chose  this  example  because  this  behavior  Is  typical  of  several  real  physical 
problems  where  logarithms  of  measured  quantities  are  more  simply  interpreted  than 
the  raw  data.  In  chemistry,  for  example,  one  uses  logs  of  rate  constants  or  equi- 
librium constants  in  any  attempted  correlations  between  structure  and  reactivity 
because  they  are  linear  functions  of  energy  differences  between  structures. 

Many  sense  responses  (brightness,  loudness,  pitch)  also  appear  to  be  logarithmic 
in  character. 

The  input  data  In  Table  2 could  have  been  logs  of  measured  data.  However, 

Instead  we  calculated  them  for  cylinders  having  the  randomly  selected  radii  and 
heights  (,7)  shown  in  Table  3.  Now  we  will  pretend  not  to  know  any  of  the  formulas 
In  Table  1 nor  the  20  factors  (log  _r  and  log  ii  values)  nor  their  14  slopes  in  the 
logarithmic  formulas  but  proceed  to  deduce  them  from  Table  2 and  subsidiary 
conditions  only,  then  check  all  these  deductions  by  Tables  1 and  3. 

The  most  time-consuming  phase  of  the  analysis  Is  the  iterative  adjustment  of 
the  parameters  until  they  satisfy  eq.  1.  In  the  first  half  of  each  cycle  we  use 
multiple  linear  regression  to  calculate  a^,  b^^,  and  from  the  observed  data 
and  the  current  factor  values  (initially  random  numbers) ; in  the  next  half  cycle 
we  use  multiple  linear  regression  to  calculate  factors  from  the  data  and  the  1- 
subscrlpted  parameters.  Further  details  are  given  under  "Phase  1 Details". 

It  is  not  necessary  to  Incorporate  any  subsidiary  conditions  prior  to  convergence. 

The  slope  parameters  before  phase  2 as  we  obtained  them  from  70  and  from  50 
data  are  shown  in  Table  4.  They  are  complicated  hybrid  functions  of  the  real 
sensitivities  to  radii  and  heights.  Phase  2 unscrambles  them  to  give  simple  direct 
measures  of  these  sensitivities. 

For  the  phase  2 transformation,  we  arbitrarily  choose  x^=0,  x^=0,  a^^'l,  and 
^j-1  as  the  four  trivial  conditions.  Therefore,  the  factors  will  become  differences 
above  or  below  those  of  the  reference  Jot,  cylinder  5,  taken  as  zero,  while 
first-mode  slopes  will  become  ratios  relative  to  that  of  the  property  1 as  a 
standard,  and  second-mode  slopes  will  become  ratios  relative  to  that  of  property  3, 
another  standard. 

For  the  first  of  the  two  critical  conditions  we  specify  Jbj^“0,  reflecting 
possible  Insight  that  face  areas  of  cylinders  are  Independent  of  cylinder  height 
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(or  of  the  second-term  factor  even  though  we  have  not  yet  deduced  the  functlon.iJ 

form  of  their  dependence  on  the  first-term  factor  nor  yet  determined  any  factors 
quantitively  from  the  analysis. 

For  the  second  of  the  critical  conditions  (the  last  condition)  we  choose 

a^*a^/2  (or  its  equivalent,  since  is  used  as  a trivial  condition), 

because  of  three  convictions:  (1)  that  total  flat  face  area  (j^=l)  is  just  a 

multiple  of  the  circular  area  of  one  end  (although  we  need  not  even  know  that  the 

multiplier  is  2) ; (2)  that  curved  area  (i^=3)  is  proportional  to  circumference 

with  a proportionality  constant  A2  that  is  independent  of  radius  but  is  a function 

of  height  (although  we  need  not  know  what  function  it  is,  l.e.,  that  A^  equals 

height  itself);  (3)  that  circumference  is  proportional  to  the  square  root  of  curved 

area  (although  we  need  not  know  the  dependence  of  either  on  radius,  nor  that  the 

proportionality  constant  A is  2/it) . These  four  statements  to  determine  critical  condition 
can  be  reasonably  Inferred  from  simple  theoret'' cai  conslderat-^ons  or  from  su'*tai'‘'e 
observations  of  another  kind:  they  cannot  be  deduced  uniquely  from  the  input  data  of  Tablt*. 
Combining,  taking  logarithms,  and  using  eq . 5 for  both  llks,  we  obtain  . / 2+A^^ , 

where  A^  is  independent  of  the  first  factor  (although  it  includes  b^,  , c^. 

An  A^,  and  A^) . Since  this  identity  must  hold  for  all  values  of  x ^ , if  follows 

that  A,=0  and  a~=a,/2. 

4 —3  —1 

Alternatively  and  equivalently,  we  could  replace  the  last  condition  by 


(or  82=-!) , reflecting  either  a theory  or  observations  that  wire  cross-section 
and  resistance  are  inversely  related,  even  though  we  have  not  yet  deduced  a complete 
formula  for  either.  Although  one  might  expect  that  another  alternative  for  the 
last  condition  could  be  a„"a, , implying  that  radius  is  equally  influential  in 

— L — D 

affecting  masses  or  volumes,  that  condition  is  found  to  be  ineffective  for  separating 
the  factors  (because  the  data  for  ^»2  and  ^=6  also  have  the  same  dependence  on 
height).  Other  undesirable  assumptions  are  orthogonality  or  zero  covariance 
between  the  factors  because  they  give  wrong  answers  In  this 
cylinder  problem,  and  are  unlikely  to  be  satisfied  by  any  small  sample. 

Substitution  of  the  chosen  set  of  six  conditions  Into  transformation  equations 
10-13  gives 


*5  ■ 0 - + 1^2:15'  + tj3 

Z5  ■ 0 " 121-5  -22^5  -23 

1 ■ ^l22-l^"  121^1^)/^ 

I3  “ 1 “ ^-11-3^  “ ti2bj^)/det 
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^3  = b - (t_22-3  ~ -21-3 

Solution  of  these  six  simultaneous  equations  give  the  six  ^ values,  which  may 
be  substituted  back,  into  transformation  equations  10-15  to  give  transformed 
parameters  consistent  with  these  six  conditions.  This  transformation  converts 
the  previous  parameters  from  either  70  or  50  data  to  the  desired  pure  parameters 
shown  in  Fig.  2 and  listed  in  the  last  columns  of  Table  4. 

The  slopes  are  all  exactly  half  of  the  coefficients  of  log  £ in  Table  1. 

The  a^  slopes  (and  also  the  slopes)  are  thus  in  correct  ratios  relative  to  one 
another.  The  factor  of  one-half  derives  from  one  of  our  four  less  significant 
subsidiary  conditions,  It  merely  puts  values  on  a scale  relative  to 

for  property  1 as  unity.  In  most  applications  of  these  numbers  only  relative 
values  are  needed,  so  it  does  not  matter  that  these  relative  have  only  ba.lf 

of  their  absolute  values.  A slightly  different  way  of  viewing  the  effect  of  tbe 

trivial  condition  a.j^=l  is  to  say  that  it  makes  the  x^  factors  be  2 log  _r  (or  log  _r“) 
instead  of  log  r_,  i . e . . measures  of  flat  face  area  rather  than  of  radius.  This  is 
a trivial  difference  because  radius,  diameter,  circumference,  and  flat  face  area 


to  be  representing,  so  the  choice  among  them  can  be  arbitrary. 

Regardless  of  the  choice  of  trivial  conditions,  the  use  of  valid  critical 
conditions  lets  us  deduce  that  mass  has  the  same  dependence  on  radius  as  flat  face 
area  (since  a2=a^),  moment  of  inertia  is  twice  as  dependent  on  radius  (since  a^=2aj^) 
and  all  the  properties  except  1 and  5 show  the  same  dependence  on  height.  Such 
deductions  about  relative  factors  and  slopes  and  the  functional  forms  of  the  proper- 
ties are  as  detailed  as  one  could  expect  from  any  kind  of  least  squares  procedure. 
This  result  obtained  either  without  or  with  missing  data  thus  seems  useful  and 
quite  satisfactory. 

_D^t^i 1 8^ 

Least  squares  (1^)  is  simply  a mathematical  method  of  fitting  data,  invoked 
because  data  are  generally  imperfect.  Data  that  are  believed  to  be  products 
of  various  unknown  powers  of  the  factors  should  be  linearized  by  taking  logarithms, 
as  illustrated  in  our  example.  Prior  recognition  or  evidence  for  a linear 
relationship  is  not  a prerequisite  for  a valid  DOVE  analysis,  but  can  often  simplify 
it  by  decreasing  n. 


-lo- 


be V,  and  the  number  of  observed  data  j ) be  ^ 


Let  the  number  of  different  llks  (^)  be  jj,  and  the  number  of  different  jots  (J[) 

Usually  data  are  available  for 
only  a small  fraction  of  the  maximum  of  ^ times  ^ combinations,  but  all  that  are 
available  and  believed  to  be  reliable  should  be  used  In  the  analysis.  To  dis- 
tinguish between  data  to  be  used  and  data  that  are  missing  or  rejected,  we  make 

unity  if  the  corresponding  Is  to  be  used,  otherwise  zero.  Although  we  have 
not  used  weights  to  reflect  differences  In  precision  or  reproducibility  of  different 
data  (because  many  measurements  are  made  or  reported  only  once)  nor  accuracy 
(because  that  Is  even  harder  to  evaluate),  we  do  use  weights  to  make  the  final 
correlation  coefficient  for  an  ilk  Independent  of  Its 

range.  Therefore  we  equate  each  weight  w^  In  eq.  1 to  the  reciprocal  of  variance 

of  the  z.,  data  for  all  i from  the  mean  of  z, . for  that  Ilk. 

— 1 j **-  — ^ 1 


w 

-1 


j j " j 


(16) 


For  eq.  5,  simultaneous  equations  of  forms  17-19  for  each  Ilk  and  20-21  for 
each  jot 


5l 


^1 


X 


I +1,  J + £,  I - I 

f I J • I 

I + i,  I lijij  * =1  1 Sy  - I 


(17) 

(18) 

(19) 

(20) 
(21) 


are  obtained  by  substituting  into  eq.  1 and  then  setting  the  3u  + 2v  partial 

derivatives  of  eq.  1 with  respect  to  each  parameter  equal  to  zero.  Beginning  with 

random  numbers  for  x.  and  y,  values,  one  uses  the  u sets  of  three 

J 
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equatlons  (17-19)  in  three  unknowns  to  solve  for  values  of  a,,  b,  and  c,. 

—1—1  —1 

These  are  then  used  In  the  v sets  of  two  equations  (20-21)  In  two  unknowns  to 
solve  for  better  values  of  Xj  and  . Thus  by  the  successive  approximation  method 
of  solving  these  equations  alternately,  one  of  the  infinite  number  of  sets  of  values 
for  the  3u  + 2v  constants  consistent  with  eq . 1 and  5 Is  finally  obtained.  This 
particular  converged  set,  dependent  on  the  initial  random  numbers,  is  then  trans- 
formed In  phase  2 into  the  unique  set  consistent  with  the  desired  six  subsidiary 
conditions. 

All  calculations  involving  real  numbers  were  done  to  a precision  of  16 
decimal  places,  using  a FORTRAN  IV  program  on  an  IBM  370-168  computer.  While 
conforming  logically  to  the  above  description,  our  program  obviated  storage  of  both 
missing  data  and  the  existence  matrix  by  use  of  three  simple  arrays  for  measured 
data,  and  each  singly  subscripted  by  only  a measured  datum  number  (jc=l , 2 , . . . ,i^) , 
and  by  searches,  when  needed,  through  these  arrays. 

Convergence  was  achieved  in  a smaller  number  of  Iterative  cycles  by  appropriate 
use  of  overrelaxation  of  all  the  l^-subscrlpted  parameters  (e.g. , making  changes 
In  them  larger  than  calculated  by  multipliers  of  1.6  or  more  in  most  cycles) 
and  by  less  frequent  but  longer  extrapolations  of  all  the  factors  (e.g. , changing 
each  by  a common  large  multiple  of  Its  total  change  In  the  last  one  or  two  cycles, 
every  15-30  cycles).  Such  techniques  are  generally  required  for  a practical  solution. 
The  square  of  the  correlation  coefficient,  deco , 


deco  - l-(^  'l  /d(+6-3u  + 2v) 

1 i 

was  calculated  just  before  each  extrapolation  and  two  cycles 

later,  and  constancy  to  12  decimal  places  used  as  a criterion  of  convergence.  Any 
extrapolation  resulting  In  a decreased  deco  two  cycles  later  was  effectively 
erased  by  return  to  the  parameters  existing  just  prior  to  extrapolation. 

Deco  corrects  for  sample  size  (through  degrees  of  freedom)  and  for  dissimilar  data 
ranges  or  unit  sizes  In  different  llks  (through  weights  as  described  above).  After 
convergence,  it  represents  the  fraction  of  the  variation  In  attributable  to 
variations  in  the  parameters  explicitly  Included  In  the  expression  chosen  for  , 
as  opposed  to  errors  or  unidentified  factors. 

Many  more  cycles  are  needed  If  a large  fraction  of  the  possible  data  are  missing. 
For  example.  In  this  cylinder  problem  we  reached  convergence  In  1 cycle  when  70 
data  were  used,  within  25  cycles  when  50  data  were  used  (deleting  the  20  Indicated 
by  stars  In  Table  2),  but  only  by  321,  350,  and  417  cycles  when  42,  41  and  40 


r»T»i!-T!S?ar 
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data  were  used  (8).  In  the  last  case,  the  data  are  fewer  than  the  number  of 
parameters  determined  (21  + 20  •=  41),  but  the  subsequent  transformation  adds 
6 subsidiary  conditions  to  the  40  data,  thereby  providing  sufficient  Information 
to  make  all  the  final  parameters  unique  and  meaningful. 

Non-linear  least  squares  based  on  the  Marquardt  algorithm  (9)  Is  an 
alternative  method  for  fitting  all  the  parameters  in  these  or  any  equations, 
linear  or  otherwise.  Unfortunately,  computer  execution  time  for  each  cycle  Is 
extremely  long  by  the  Marquardt  procedure  when  the  number  of  parameters  exceeds 
10,  being  proportional  to  the  cube  of  the  total  number  of  parameters,  in 
striking  contrast  to  DOVE,  where  it  is  proportional  to  the  first  power.  In  the 
present  example  with  41  parameters,  total  execution  time  for  convergence  is  7 sec. 
with  70  data  and  51  sec.  with  69  data  (only  one  missing  datum),  vs.  DOVE  execution 
times  of  0.44  sec.  with  70  data,  1.4  sec.  with  50  data,  and  12  sec.  with  40 
data  (30  missing  data). 


Phase  2 Details_^ 

r 2 c 2 

Normalization  conditions  for  the  factors  ( ^x,  =v  and  ly . =v)  are  less 

j ^ .1  ^ 

convenient  for  trivial  conditions  than  selection  of  values  for  particular  jots, 
because  they  change  the  sizes  of  the  units  in  which  parameters  are  expressed  every 
time  data  are  added  or  deleted. 

For  any  set  of  subsidiary  conditions  to  be  acceptable,  the  determinant  of  T 
must  be  nonzero.  In  the  course  of  deriving  transformation  equations  for  more 
than  100  different  sets  of  six  subsidiary  conditions  (appropriate  to  eq.  5 with 
different  kinds  of  problems),  we  have  found  it  always  wise  to  check  this  early 
in  the  derivation.  With  many  possible  sets  of  subsidiary  conditions,  it  is  much 
easier  to  do  two  or  more  successive  transformations,  incorporating  some  but  not 
all  of  the  desired  conditions  in  the  first  transformation.  Substitution  of  the 
values  of  previously  fixed  parameters  often  considerably  simplifies  the  derivation 
of  equations  for  subsequent  transformations. 

DOVE  has  an  enormous  potential  in  many  fields  for  correctly  interpreting  data 
expressible  by  equation  5,  and  a potential  unmatched  by  any  other  method  when  some 


of  the  data  are  missing.  Phase  1 should  also  be  applicable  to  problems  involving 
three  or  more  modes.  However,  with  three  or  more  modes,  the  phase  2 problem  of 
finding,  justifying  and  incorporating  the  required  large  number  of  critical  conditions 
becomes  a major  obstacle  in  DOVE  or  any  linear  least  squares,  nonlinear  least 
squares,  principal  components  or  other  factor  analysis  procedure  purporting  to 
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provide  meaningful  parameters.  Orthogonallzatlon  between  factors  of  different 
modes  has  often  been  used,  but  the  number  of  such  conditions  Is  Insufficient, 
orthogonality  la  never  satisfied  by  real  data  from  statistically  small  samples, 
and  often  would  not  be  satisfied  even  by  Infinite-size  samples  when  the  factors 
have  real  physical  significance.  For  example,  for  substituent  effects  In  chemistry, 
the  resonance  and  other  electronic  (field  and  Inductive)  factors  associated  with 
substituents  certainly  have  a small  but  undoubtedly  physically  meaningful  positive 
correlation;  and  for  solvent  effects,  the  two  types  of  factors  associated  with 
the  solvent  (anlon-stablllzlng  ability  and  catlon-stablllzlng  ability)  clearly 
have  a weak  but  significant  negative  correlation.  To  assume  that  they  do  not 
would  force  the  derived  factors  to  take  on  numerical  values  that  are  complicated 
hybrids  rather  than  pure  measures  of  these  physical  characteristics.  Unless  the 
proper  number  of  meaningful  and  valid  critical  conditions  Is  Incorporated,  the 
factors  and  slopes  have  no  simple  Interpretation  or  meaning,  even  though  all 


predicted  data  may  agree  very  accurately  with  observed  data  j • 


Summary 

DOVE  can  be  useful,  even  when  many  or  most  data  are  missing,  for  (1)  generalized 

least  squares  fitting  to  evaluate  a self-consistent  set  of  all  parameters  In  an 

expression  for  predicting  all  missing  data,  and  (2),  without  changing  the  predicted 

data,  to  transform  the  set  of  parameters  obtained  In  phase  1^  so  that  each  final 

parameter  has  a simple,  pure,  realistic,  physical  meaning.  Since  predicted  data 

are  expressed  as  A.S.  + + ...  with  n product  terms,  phase  2 requires  In- 

2 i j 1 j A 

corporation  of  n +n  Independent  subsidiary  conditions,  of  which  2n  are  arbitrary, 

2 

l.e. , merely  fix  zero  reference  points  and  scale  unit  sizes,  but  n -ii  are 
critical,  l.e. , must  be  relationships  between  particular  parameters  supported  by 
other  Information.  Both  phases  are  Illustrated  by  a two-mode  application  with  7 1_, 
10  hence  41  parameters,  to  fit  the  data  plus  the  6 subsidiary  conditions. 

Valid  parameters  are  obtained  although  30  of  the  70  possible  data  are  missing. 
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Table  1.  Cylinder  properties  used. 


Property  observed 


Formula  to  be  deduced 


ilk 

nonlinear 

linear  log  form 

1 

total  area  of  flat  faces 

2 log  r + log(2n) 

2 

mass ;6  - 10  g/cm  ^ 

2 log  + log 

+ log(6iT) 

3 

area  of  curved  surface 

log  Xj  log  hj  + 

log(2ir) 

4 

axle  moment  of  inertia 

6.r^'h^/2 

4 log  + log 

+ log(6TT/2) 

5 

aspect  ratio 

log  r - log 
-J  T 

6 

volume  of  circumscribed 
square  prism 

2 log  r + log 

+ log  4 

7 

resistance  between  faces; 
P“0.1  ohm  cm 

-2  log  £j  + log 

+ log(p/lT) 

Table  3,  A group 

of  random  numbers 

, used  to  generate 

Table  2. 

Cylinder,  £ 

Radius,  £j 

Height,  hj 

log(rj/£^) 

log (hj/hj) 

1 

0.021658190 

0.408169508 

-1.61 

-0.36 

2 

0.543617487 

0.456423521 

-0.21 

-0.31 

3 

0.030803025 

0.153893471 

-1.46 

-0.78 

4 

0.521428823 

0.799775958 

-0.23 

-0.07 

5 

0.890675187 

0.930589199 

(0.00) 

(0.00) 

6 

0.796412110 

0.968748212 

-0.05 

0.02 

7 

0.190720081 

0.059738331 

-0.67 

-1.19 

8 

0.923573375 

0.606675744 

0.02 

-0.19 

9 

0.175991654 

0.081358790 

-0.70 

-1.06 

10 

0.358400822 

0.323721051 

-0.40 

-0.46 

Table  2.  Input  data  set  used  to  test  various  procedures. 
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Table  4.  Slopes,  ^ and  b 


After  convergence  to  meet  the  least  squares  conditions  but  before  parameter 
transformations  to  incorporate  six  subsidiary  conditions. 

r Relative  values  after  Incorporation  of  six  subsidiary  conditions. 

4 Value  specified  by  one  of  the  six  subsidiary  conditions. 

§ Number  of  input  data  , used  in  the  analysis. 
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Legend  for  Figure  1 


Figure  1.  Shapes  of  cylinders  1,  3,  and  9 In  the  example.  This  synthetic 
but  Illustrative  problem  uses  data  on  7 measurable  properties  of  these  and 
7 other  cylinders  to  deduce  factors  that  are  pure  measures  of  radius  or 
height  for  each  cylinder  (Figure  2),  and  also  correct  relative  sensitivities 
to  these  factors  for  each  property  (Table  4) . 


Legend  for  Figure  2 


Figure  2.  A plot  of  factors  vs.  factors  calculated  by  DOVE,  showing 


that  It  is  superimposable  on  a plot  of  relative  log  height,  log  h 


i 


log  h^, 


vs.  relative  log  radius,  log  - log  _r^,  for  10  cylinders.  These  factors 
are  calculated  from  either  a complete  (70)  or  partial  (50  or  40)  set  of 
logarithmic  data  (Table  2)  on  the  7 properties  listed  in  Table  1. 


